Processing Informal, Romanized Pakistani Text Messages

نویسندگان

  • Ann Irvine
  • Jonathan Weese
  • Chris Callison-Burch
چکیده

Regardless of language, the standard character set for text messages (SMS) and many other social media platforms is the Roman alphabet. There are romanization conventions for some character sets, but they are used inconsistently in informal text, such as SMS. In this work, we convert informal, romanized Urdu messages into the native Arabic script and normalize non-standard SMS language. Doing so prepares the messages for existing downstream processing tools, such as machine translation, which are typically trained on well-formed, native script text. Our model combines information at the word and character levels, allowing it to handle out-of-vocabulary items. Compared with a baseline deterministic approach, our system reduces both word and character error rate by over 50%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Romanized Berber and Romanized Arabic Automatic Language Identification Using Machine Learning

The identification of the language of text/speech input is the first step to be able to properly do any language-dependent natural language processing. The task is called Automatic Language Identification (ALI). Being a well-studied field since early 1960’s, various methods have been applied to many standard languages. The ALI standard methods require datasets for training and use character/wor...

متن کامل

Romanized Language Identification and Transliteration System for Security with an Authentication System Using Persuasive Cued Click Points - RLITS

Romanized script is popular today for communication in every country, as the script is almost universally enabled in text processors. In countries like India which is a linguistic cauldron, it is very common to see English text in email messages and chat transcripts, with generous sprinkling of words from local languages in roman script. Dubbed as Manglish (Malayalam and English) etc., this rom...

متن کامل

Addressing challenges in automatic Language Identification of Romanized Text

Due to the diversity of documents on web, language identification is a vital task for web search engines during crawling and indexing of web documents. Among the current challenges in language-identification, the unsettled problem remains identifying Romanized text language. The challenge in Romanized text is the variations in word spellings and sounds in different dialects. We propose a Romani...

متن کامل

Compressing Semi-Structured Text Using Hierarchical Phrase Identifications

Many computer files contain highly-structured, predictable information interspersed with information which has less regularity and is therefore less predictable-such as free text. Examples range from word-processing source files, which contain precisely-expressed formatting specifications enclosing tracts of natural-language text, to files containing a sequence of filled-out forms which have a ...

متن کامل

Compressing Semi-Structured Text using Hierarchical Phrase Identification

Many computer files contain highly-structured, predictable information interspersed with information which has less regularity and is therefore less predictable—such as free text. Examples range from word-processing source files, which contain precisely-expressed formatting specifications enclosing tracts of natural-language text, to files containing a sequence of filled-out forms which have a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012